Text Clustering on Latent Thematic Spaces: Variants, Strengths and Weaknesses

نویسندگان

  • Xavier Sevillano
  • Germán Cobo
  • Francesc Alías
  • Joan Claudi Socoró
چکیده

Deriving a thematically meaningful partition of an unlabeled document corpus is a challenging task. In this context, the use of document representations based on latent thematic generative models can lead to improved clustering. However, determining a priori the optimal document indexing technique is not straighforward, as it depends on the clustering problem faced and the partitioning strategy adopted. So as to overcome this indeterminacy, we propose deriving a single consensus labeling upon the results of clustering processes executed on several document representations. Experiments conducted on subsets of two standard text corpora evaluate distinct clustering strategies based on latent thematic spaces and highlight the usefulness of consensus clustering to overcome the indeterminacy regarding optimal document indexing.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

Strengths and Weaknesses of Clinical Education Settings from the Viewpoint of Midwifery Students and Educators of Tabriz University of Medical Sciences

Background: Achieving a desirable clinical education requires continuous assessment of the current situations in clinical education and identifying the strengths and weaknesses. This study aimed to assess the strengths and weaknesses of the clinical education fields. Methods: This is a cross-sectional and descriptive study in which the strengths and weaknesses of clinical education settings wer...

متن کامل

High Dimensional Cluster Analysis Using Path Lengths

A hierarchical scheme for clustering data is presented which applies to spaces with a high number of dimension (N D > 3). The data set is first reduced to a smaller set of partitions (multi-dimensional bins). Multiple clustering techniques are used, including spectral clustering, however, new techniques are also introduced based on the path length between partitions that are connected to one an...

متن کامل

Co-clustering of bilingual datasets as a mean for assisting the construction of thematic bilingual comparable corpora

We address in this paper the assisted construction of bilingual thematic comparable corpora by means of co-clustering bilingual documents collected from raw sources such as the Web. The proposed approach is based on a quantitative comparability measure and a co-clustering approach which allow to mix similarity measures existing in each of the two linguistic spaces with a ”thematic” comparabilit...

متن کامل

Chinese Text Summarization Based On Thematic Area Detection

Automatic summarization is an active research area in natural language processing. This paper has proposed a special method that produces text summary by detecting thematic areas in Chinese document. The specificity of the method is that the produced summary can both cover many different themes and reduce its redundancy obviously at the same time. In this method, the detection of latent themati...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007